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Large-scale statistical analysis of data sets associated with genome 
sequences plays an important role in modern biology. A key compo- 
nent of such statistical analyses is the computation of p-values and 
confidence bounds for statistics defined on the genome. Currently 
such computation is commonly achieved through ad hoc simulation 
measures. The method of randomization, which is at the heart of 
these simulation procedures, can significantly affect the resulting sta- 
tistical conclusions. Most simulation schemes introduce a variety of 
hidden assumptions regarding the nature of the randomness in the 
data, resulting in a failure to capture biologically meaningful relation- 
ships. To address the need for a method of assessing the significance 
of observations within large scale genomic studies, where there of- 
ten exists a complex dependency structure between observations, we 
propose a unified solution built upon a data subsampling approach. 
We propose a piecewise stationary model for genome sequences and 
show that the subsampling approach gives correct answers under this 
model. We illustrate the method on three simulation studies and two 
real data examples. 

1. Introduction. 

1.1. Background. This paper grew out of a number of examples arising 
in data coming from tiie Encyclopedia of DNA Elements (ENCODE) Pi- 
lot Project [Birney et al. (2007)], which is composed of multiple, diverse 
experiments performed on a targeted 1% of the human genome. Computa- 
tional analyses of this data are aimed at revealing new insights about how 
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the information coded in the DNA blueprint is turned into functioning sys- 
tems in the hving cell. Variations of some of the methods described here 
have been applied at various places in that paper, as well as in Margulies 
et al. (2007), for assessing significance and computing confidence bounds for 
statistics that operate along a genomic sequence. The background of these 
methods is described in cookbook form in the supplements to those papers, 
and it is the goal of this paper to present them rigorously and to develop 
some necessary refinements. 

Essentially, we will argue that, in making inference about statistics com- 
puted from "large" stretches of the genome, in the absence of real knowledge 
about the evolutionary path which led to the genome in question, the best 
we can do is to model the genome by a piecewise stationary ergodic random 
process. The variables of this process can be base pair composition or some 
other local features, such as various annotated functional elements. 

In the purely stationary case some of the types of questions that we will 
address, such as tests for independence of point processes, confidence bounds 
for expectations of local functions and goodness of fit of models, have been 
considered extensively. However, inference for piecewise stationary models 
appears not to have been investigated. With the advent of enormous amounts 
of genomic data, all sorts of inferential questions have arisen. The proposed 
model may be the only truly nonparametric approach to the genome, al- 
though, just as in ordinary nonparametric statistics, there are many possible 
ways of carrying out inference. 

Our methods are based on a development of the resampling schemes of 
Politis and Romano (1994), Politis, Romano and Wolf (1999) and the block 
bootstrap methods of Kiinsch (1989). As we shall see, in many situations, 
Gaussian approximations can replace these schemes. But in these situations, 
as with the ordinary bootstrap, we believe that a subsampling approach is 
valuable for the following reasons: 

• Letting the computer do the approximation is much easier. 

• Some statistics, such as tests of the Kolmogorov-Smirnov type, are func- 
tions of stochastic processes to which a joint Gaussian approximation 
applies. Then, limiting distributions can only be computed by simulation. 

• The bootstrap distributions of our statistics show us whether the approx- 
imate Gaussianity we have invoked for the "true" distribution of these 
statistics is in fact warranted. This visual confirmation is invaluable. 

This paper is organized as follows. We begin with some concrete examples 
from the ENCODE data as well as other types of genomic data in Section 1.2, 
and proceed with a motivated description of our model in Section 2. Our 
methods are discussed both qualitatively and mathematically in Sections 
3 and 4. Section 5 contains results from simulation studies and real data 
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analysis. Proofs of theorems stated in Sections 3 and 4 can be found in the 
supplemental article [Bickel et al. (2010)]. 

The statistics and methods discussed in this paper have been imple- 
mented in several computing languages and are available for download at 
http://www.encodestatistics.org/. Each of these implementations runs 
in nlog(n) time, where n is the number of instances of the more frequent 
feature. On a desktop PC (Intel Core Duo 3 GHz and 2 Gb RAM) the 
Python version takes over 1000 samples per second for features on the order 
of 10^ instances. 

1.2. Motivating examples. We start with several fundamental questions 
that arise in genomic studies. 

• Association of functional elements in genomes. In genomic analyses, a 
natural quantity of interest is the association among different functional 
sites/features annotated along the DNA sequence. Its biological motiva- 
tion comes from the common belief that significant physical overlapping or 
proximity of functional sites in the genome suggests biological constraints 
or relationships. In the ENCODE project, to understand the possible func- 
tional roles of the evolutionarily constrained sequences that are conserved 
across multiple species, overlap between the constrained sequences and 
several experimental annotations, such as 5'UTR, RxFrags, pseudogenes, 
and coding sequences (CDSs), have been evaluated using the method dis- 
cussed in this paper. It was found that the overlap of most experimental 
annotations with the constrained sequences are significantly different from 
random [Birney et al. (2007)]. An illustrative example from The ENCODE 
Project [Birney et al. (2007)] is detailed in Section 5.1. 

• Cooperativity between transcription factor binding sites. In some situ- 
ations, there is interest to study the associations between neighboring 
functional sites that do not necessarily overlap. For instance, it is known 
that transcription factors often work cooperatively and their binding sites 
(TFBS) tend to occur in clusters [Zhang et al. (2006)]. Consequently, 
methods for identifying interacting transcription factors usually involve 
evaluating the significance of co-occurrences of their binding sites in a 
local genomic region [Zhou and Wong (2004); Das, Banerjee and Zhang 
(2004); Yu, Yoo and Greenwald (2004); Huang et al. (2004); Kato et al. 
(2004); Gupta and Liu (2005)]. This problem has the same formulation as 
the above ENCODE examples given a functional site defined as follows: 
for a TFBS of length I at position i, we define the region (i — m,i + l + m) 
as a functional site. Then two overlapping functional sites are equivalent 
to two neighboring TFBSs with interdistance less than 2m, and the meth- 
ods discussed in this paper for evaluating the significance of overlapping 
functional features can be applied. We leave this and related applications 
which involve considering statistics of the K-S type to a later paper. 
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• Correlating DNA copy number with genomic content. Recent technology 
has made it possible to assay DNA copy number variation at a very fine 
scale along the genome [for review, see Carter (2007)]. Many studies, for 
example, Redon et al. (2006), have shown that such variation in DNA 
copy number is a common type of polymorphism in the human genome. 
To what extent do these regions of copy number changes overlap with 
known genomic features, such as coding sequences? Redon et al. per- 
formed such an analysis and argued that copy number variant regions 
have a significant paucity for coding regions. The p- values supporting this 
claim were based on random permutations of the start locations of the 
variant segments. This assumes uniformity and stationarity of the copy 
number variants. However, CNVs do not occur at random and are often 
clustered in regions of the genome containing segmental duplications. The 
methods discussed in this paper for evaluating the significance of overlap- 
ping features, which assume neither uniformity nor stationarity, can again 
be applied to this problem. Actually, the results from our method suggest 
a different conclusion on this problem (see Section 5.5). 

As we have seen in these examples, a common question asked in many 
applications is the following: Given the position vectors of two features in 
the genome, for example, "conservation between species" and "transcription 
start sites," and a measure of relatedness between features, for example, 
base or region percentage overlap, how significant is the observed value of 
the measure? How does it compare with that which might be observed "at 
random?" 

The essential challenge in the statistical formulation of this problem is 
the appropriate modeling of randomness of the genome, since we observe 
only one of the multitudes of possible genomes that evolution might have 
produced for our and other species. 

How have such questions been answered previously? Existing methods em- 
ploy varied ways to simulate the locations of features within genomes, but 
all center around the uniformity assumption of the features' start positions: 
The features must occur homogeneously in the studied genome region, for 
example, Blakesley et al. (2004) and Redon et al. (2006). This assumption 
ignores the natural clumping of features as well as the nonstationarity of 
genome sequences. Clumping of features is quite common along the genome 
due to either the feature's own characteristic, for example, transcription 
factor binding sites (TFBSs) tend to occur in clusters, or the genome's evo- 
lutionary constraints, for example, conserved elements are often found in 
dense conservation neighborhoods. Ignoring these natural properties could 
result in misleading conclusions. 

In this paper we suggest a piecewise stationary model for the genome 
(see Section 2) and, based on it, propose a method to infer the relationships 
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between features which we view as "nonparametric" as possible (see Sections 
4.2 and 4.4). The model is based on assumptions which we demonstrate in 
real data examples to be plausible. 

2. The piecewise stationary model. 

2.1. Genomic motivation. We postulate the following for the observed 
genomes or genomic regions: 

• They can be thought of as a concatenation of a number of regions, each 
of which is homogenous in a way we describe below. 

• Features that are located very far from each other on the average have 
little to do with each other. 

• The number of such homogeneous regions is small compared to the total 
length of the observed genome. 

These assumptions, which form the underpinning of our block stationary 
model for genomic features, are motivated by earlier studies of DNA se- 
quences, which show that there are global shifts in base composition, but 
that certain sequence characteristics are locally unchanging. One such char- 
acteristic is the GC content. Bernardi et al. (1985) coined the term "iso- 
chore" to denote large segments (of length greater than 300 Kb) that have 
fairly homogeneous base composition and, especially, constant GC compo- 
sition. Even earlier, evidence of segmental DNA structure can be found 
in chromosomal banding in polytene chromosomes in drosophila, visible 
through the microscope, that result from underlying physical and chemi- 
cal structure. These banding patterns are stable enough to be used for the 
identification of chromosomes and for genetic mapping, and are physical 
evidence for a block stationarity model for the GC content of the genome. 

The experimental evidence for segmental genome structure and the in- 
creasing availability of DNA sequence data have inspired attempts to compu- 
tationally segment DNA into statistically homogeneous regions. The paper 
by Braun and Miiller (1998) offers a review of statistical methods developed 
for detecting and modeling the inhomogeneity in DNA sequences. There 
have been many attempts to segment DNA sequences by both base com- 
position [Fu and Curnow (1990); Churchih (1989, 1992); Li et al. (2002)] 
and chemical characteristics [Li et al. (1998)]. Most of these computational 
studies concluded that a model that assumes block-wise stationarity gives a 
significantly better fit to the data than stationary models [see, for example, 
the conclusions of two very different studies by Fickett, Torney and Wolf 
(1992) and Li et al. (1998)]. 

A subtle issue in the definition of "homogeneity" is the scale at which 
the genome is being analyzed. Inhomogeneity at the kilobase resolution, for 
example, might be "smoothed out" in an analysis at the megabase level. 
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The level of resolution is a modeling issue that must be considered carefully 
with the goal of the analysis in mind. 

Implicit in our formulation is an "ergodic" hypothesis. We want proba- 
bilities to refer to the population of potential genomes. We assume that the 
statistics of the genome we have mimic those of the population of genomes. 
This is entirely analogous to the ergodic hypothesis that long term time 
averages agree with space averages for trajectories of dynamic systems. 

2.2. Mathematical formulation. In mathematical terms, the block sta- 
tionarity model assumes that we observe a sequence of random variables 
{Xi, . . . , Xn} positioned linearly along the genomic region of interest. Xi^,k = 
1, . . . , n, may be base composition, or some other measurable feature. We as- 
sume that there exist integers r = r^") = (tq, . . . , tjj), where = tq < ti < 
■ ■ ■ < Tif = n, such that the collections of variables, {X^-^ , . . . , X^-^^^^}, are sep- 
arately stationary for each i = 0, . . . ,U — 1. We let ni = Ti — rj_i be the length 
of the ith region, and let there be U such regions in total. For convenience, 
we introduce the mapping 

tt:{1, . . . ,n} ^ {{i,j) -.1 <i <U,1 < j <n,} 

which relates the relabeled sequence, {Xij :1 < i < U,l < j < rii}, to the 
original sequence We write vr = (7ri,7r2) with 7r{k) = {i,j) if 

and only if k = Ti+ j. We will use the notation Xij and Xk interchangeably 
throughout this paper. 

For any ki, k2, let be the fj-field generated by X^j^, . . . , X^^. Define 
m{k) to be the standard Rosenblatt mixing number [Dedecker et al. (2007)], 

m{k) = snp{\F{AB) - F{A)F{B)\ ■.A€T[,B€ T'l^k: l<l<n-k]. 

Then, assumptions 1-3 stated in Section 2.1 translate to the following: 

Al. The sequence {^i, . . . ,Xn} is piecewise stationary. That is, {Xij : 1 < 

J < ?^^} is a stationary sequence for i = 1, . . . , [/. 
A2. There exists constants c and /? > such that m{k) < ck~^ for all k. 
A3. U/n^O. 

An immediate and important consequence of A1-A3 is that, for any fixed 
smah k, if we define Wi = {Xi, . . . , Xk),W2 = (Xfc+i, • • • , X2k), ■■.,Wm = 
{Xn-k+i, ■ ■ ■ -.Xn), where m = n/k, then {Wi,...,Wm} also obey A1-A3. 
This is useful, for example, in the region overlap example considered in the 
next section. 

The remarkable feature of these assumptions, which are more general to 
our knowledge than any made heretofore in this context, is that they still 
allow us to conduct most of the statistical inference of interest. Not surpris- 
ingly, these assumptions lead to more conservative estimates of significance 
than any of the previous methods. 



SUBSAMPLING METHODS FOR GENOMIC INFERENCE 



7 



3. Vector linear statistics and Gaussian approximation. We study the 
distribution of a class of vector linear statistics of interest under the above 
piecewise stationary model. As an illustration, we consider the ENCODE 
data examples, and suppose that we are interested in base pair overlap 
between features A and B. We can represent base pair overlap by defining 

/fc = 1, if position k belongs to feature A and otherwise, 
Jfc = 1, if position k belongs to feature B and otherwise. 

We can then define X). = IkJk to be the indicator that position k belongs 
to both features A and B. Then, for the n = 30 megabases of the ENCODE 
regions, the mean base pair overlap is equal to 

n 

X = ^Xk/n. 

k=l 

Another biologically interesting statistic is the (asymmetric) region over- 
lap, defined as follows: suppose the consecutive feature stretches are Ti, . . . , 
with lengths ri , . . . , Tq, , and the corresponding nonfeature stretches 5i , . . . , 5^ 
with lengths pi,. . . ,pp. We assume here that the initial and final stretches 
consist of one feature and one nonfeature stretch. The complementary situa- 
tion, when both initial and final stretches are of the same type, is dealt with 
similarly. Without loss of generality, suppose the initial stretch is nonfeature. 
Then, Si = {l,...,pi}, Ti = {pi + 1, . . . , pi + n}, 52 = {/Oi + n + 1, . . . , + 
Ti + P2}) etc. Using Ik, Jk as indicators of feature identity, we define the (un- 
normalized) region overlap of feature A stretches with feature B stretches as 
h Ylt=i where Vt = I - Uk^TA.S^ ~ "^k), where Ta,i, ■ ■ ■ ,TA,a denote the 
feature A stretches. This statistic is not linear in terms of functions of single 
basepairs, but is linear in functions of blocks of feature B. These blocks are 
of random sizes, but are consistent with our hypothesis of piecewise station- 
arity that, except for end effects due to feature instances crossing segment 
boundaries, the Vt are also stationary. If the lengths Ti,...,Ta are negligi- 
ble compared to n and a is of the order of n, we can expect the mixing 
hypothesis to remain valid. 

In general, we focus our attention on statistics that can be expressed as a 
function of the mean of g(Xj), where g is some well behaved d-dimensional 
vector function to be characterized in later sections. By the fiexible definition 
of g, this encompasses a wide class of situations. 

First, we consider vector linear statistics of the form 

n 

T„(X)=n^i^g(Xfc). 

k=l 
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We introduce the following notation: 

u 



i=l 



where 



fi = rii/n 



and 

(3.1) S„ = Yar{n^/^Tn) = ^ /^^(n/^), 



u 



i=l 



where 



m , 

a(m) = ao + 2^C7,,(l 



1) 



and 

(3.2) Qo^Varg(Xi), ^ Cov(g(Xa), g(X,(;+i))). 

In Theorem 3.1 below, we show asymptotic Gaussianity of T„ given a few 
more technical assumptions: 

A4. iEi:n,<i?^*^0 for alU<oo. 
A5. Vi, |g|oo < C < oo. 

A6. < eo ^ ll^nll ^ ^0 ^) for all n, where || • || is a matrix norm. 

In particular, A4 implies that the contribution of "small regions" to the 
overall statistic must not be too large. 

Theorem 3.1. Under conditions A1-A6, 

(3.3) nV2s;i/2(T„_^)^AA(0,I), 
where I is the d x d identity. 

The proof of the theorem is in the supplemental article [Bickel et al. 
(2010)]. If we have estimates r of r which are consistent in a suitably uniform 
sense, then estimates of Ca, Ci{m) using t in place of r are also consistent. 
However, simply plugging these estimates into (3.1) does not yield consistent 
estimates of o"^ if our approach were to compute confidence intervals by 
Gaussian approximation. This is well known for the stationary case. Some 
regularization is necessary. We do not pursue this direction but prefer to 
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approach the inference problem from a resamphng point of view — see next 
section. 

In many cases, the statistics of interest are not hnear. For example, in 
the analysis of the ENCODE data a more informative statistic is the %bp 
overlap defined as 

(3.4) 
where 

n 
k=l 

is the total base count of feature A. 

More conceptually, the region overlap is 

(3.5) 

^ k=i 

where Wj = Yl^=i ^k-ii^ — Ik), the number of feature A instances. 

A standard delta method computation shows that the standard error of 
B can be approximated as follows: Let n{D) and n{X) be respectively the 
expectation of D and X . Then, 

X ti{X) ^ X - fi{X) (D-^(D)) 

m(d) 



and, hence, we can approximate ^ by a Gaussian variable with mean ^^^^ 
and variance 



where a'^{B), a'^{X), cr^(-D) are the corresponding variances and Co\{X ,D) 
denotes the covariance. In doing inference, we can use the approximate Gaus- 
sianity of B with (B) estimated using the above formula with regularized 
sample moments replacing the true moments. 

We also note that goodness of fit or equality of population test statistics, 
such as Kolmogorov-Smirnov and many others, can be viewed as functions 
of empirical distributions, which themselves are infinite-dimensional linear 
statistics, and we expect, but have not proved, that the methods discussed 
in this paper and the underlying theories apply to those cases as well, under 
suitable assumptions. 



10 



p. J. BICKEL ET AL. 



4. Subsampling based methods. Here we propose a subsampling based 
approach, in particular, a combined segmentation-block subsampling method 
to conduct statistical inference under the piecewise stationary model, which 
we call "segmented block subsampling." In our method, the segmentation 
parameters governing scale are chosen first and then the size of the sub- 
sample is chosen based on stability criteria. The segmentation procedure, 
as we discussed, is motivated by the heterogeneity of large-scale genomic 
sequences. The block subsampling approach takes into account the local 
genomic structure, such as natural clumping of features, when conducting 
statistical inference. We explicitly demonstrate the advantages of using seg- 
mentation and block subsampling by simulation studies in Section 5. 

4.1. Stationary block subsampling. Below we first review the results re- 
lated to the stationary block bootstrap method in a homogeneous region 
{U = 1), and then show how the method breaks down when it is applied to 
a piecewise stationary sequence {U > 1). 

4.1.1. Review of results for the case of U = 1. For completeness, we re- 
call the following basic algorithm of Politis and Romano (1994) to obtain 
an estimate of the distribution of the statistic T„(Xi, . . . , X„) under the 
assumption that the sequence Xi, . . . ,X„ is stationary (i.e., U = 1). 

Algorithm 4.1. (a) Given L<^n choose a number N uniformly at 
random from {l,...,n — L}. 

(b) Given the statistic T, as above, compute 



(c) Repeat B times with replacement to obtain T^^, . ■ . ,T^*lb- 

(d) Estimate the distribution of y/n{Tn — n) by the empirical distribution 



in probability for some constant S if (3.3) holds and if L — )■ oo, L/n^O. As 
usual, convergence of C*^ in law in probability simply means that if p is any 

metric for weak convergence on R , then p(£g,£)— )-0. 

Since all variables we deal with are in L2, we take p to be the Mallows 
metric, 

pIj{F, G) = min{£;p(T^ -Vf:P such that W^F.V^G). 
Useful properties of pM are as follows: 



TL(-^Ar+l, . . .,Xm+l) = T^i- 



C*n of 




By Theorem 4.2.1 of Pohtis, Romano and Wolf (1999) 



(4.1) 




SUBSAMPLING METHODS FOR GENOMIC INFERENCE 



11 



(a) plj{ETTiFi, SvrjGi) < ll'Kip\j{Fi, d) for all vrj > 0, Evrj = 1 and 

(b) If F = Fi * • • • * Fm, G = Gi* • • • * Gm, that is, F and G are distribu- 
tions of sums of m independent variables, then p\[{F, G) < Xll^i p\i{PiiGi)- 

For convenience, when no confusion is possible, we will write /9Af(VF, y) for 
Pm{F, G) for random variables W ^ F, V ^ G. 

4.1.2. Performance of the block subsampling method in the piecewise sta- 
tionary model when U > 1. We turn to the analogue of Theorem 4.2.1 
in Politis, Romano and Wolf (1999) for U > 1. We consider a vector linear 
statistic, for which the true distribution was described in Section 3. Here, we 
ask how Algorithm 4.1, which assumes stationarity, performs in this nonsta- 
tionary context. We show that, in general, it does not give correct confidence 
bounds but is conservative, sometimes exceedingly so. The results depend 
on L, the subsample size, which is a crucial parameter in Algorithm 4.1. 
We sketch these issues in Theorem 4.2 below, for simplicity, letting g be the 
one-dimensional identity function g{x) = x. Let 

1=1 

rii n U 

j=l k=l i=l 

Also let 

n* = Cardinality of Si = {k : k £ [N, N + L],ni{k) = i} 

and 

X: = lin*^0)Y,{X^J■■j€Si}/n*, 

j 

u _ * 
Xl = Y,fiX* where = 

i=l 

We introduce one assumption that is obviously needed for any analysis of 
the block or segmented resampling bootstraps: 

A7. L^oo, 

and two other assumptions which are used in different parts of Theorem 4.2 
but not in the rest of the paper, and are thus given a different numbering: 



Bl. L/n^O. 
B2. (Lf/)/n^O. 
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Theorem 4.2. Let Cn he the distribution which assigns mass fi to — 
A*)) 1 < "i < f^, o,nd write Ci for Ci{nfi). Suppose assumptions A1-A5 and 
A7 hold: 

(i) // B2 holds, pMiXl-X,Cn)^0. 

(ii) If 



(4.2) 



1=1 



and Bl holds, then 



PM 



u 



L(X2-X),j;/,AA(0,C,) 



i=l 



(iii) // (4.2) and Bl hold and 

u 

(4.3) J^/^l(|S„-a|>e)^0 



i=l 



for all e > 0, then 



PM 



;^/I(x2-x),AA(o,s„)) Ao. 



The implications of Theorem 4.2 are as follows. If equation (4.2) does 
not hold, then X2 — X does not converge in law at scale L~^/^ so that it 
does not reflect the behavior of L^^'^{Xl — p) at all. This is a consequence 
of inhomogeneity of the segment means. Evidently in this case, confidence 
intervals of the percentile type for p, [X + c,„(q), X + c„(l — a)], where c„(a) 
is the a quantile of the distribution of — X, will have coverage probability 
tending to 1, since Cniot) and Cyi{\. — oi) do not converge to at rate L ^''^ 
as they should. If B2 does not hold, we have to consider the possibility 
that \N ,N ^ L\ covers Kj^ consecutive segments, whose total length is o(n), 
such that the average over all such blocks is close to p. However, in the 
absence of a condition such as (4.2) or mutual cancellation of p*^ the scale 
of X*^ will be larger than L^^l"^ . These issues will be clarified by the proof 
of Theorem 4.2 in the supplemental article [Bickel et al. (2010)]. We note 
also that (4.2) can be weakened to requiring that the mean of blocks of 
consecutive segments whose total length is small compared to n be close to 
p to order o(L~^/^). But our statement makes the issues clear. Finally, note 
that B2 holds automatically if the number of segments \J is bounded and if 
Bl holds. 

If (4.2) does hold but (4.3) does not, then \fTj{X1^ — X) converges in law 
to the Gaussian mixture Y^i=\ C'j). The mixture of Gaussians is more 
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dispersed in a rough sense than a Gaussian with the same variance, which 
is 

u 

i=l 

see Andrews and Mallows (1974). Especially note that, if W has the mixture 
distribution and V is the Gaussian variable with the same variance, then 

by Jensen's inequality. This suggests that the tail probabilities will also 
be overestimated. The overdispersion here, which leads to conservativeness 
that is not as extreme as in case (i), is due to inequality of the variances 
from segment to segment. Finally, if (4.3) holds, then the segments have 
essentially the same mean and variance and stationary block subsampling 
works. 

A mark of either (4.2) or (4.3) failing is a lack of Gaussianity in the 
distribution of — X. This was in fact observed at some scales in the 
ENCODE project, which led us to crudely segment on biological grounds 
with reasonable success. However, the correct solution, which we now present 
in this paper, is to estimate the segmentation and appropriately adjust the 
subsampling procedure. 

4.2. A segmentation based block subsampling method. We saw in the pre- 
vious section that the naive block subsampling method that was designed for 
the stationary case breaks down when the sequence follows a piecewise sta- 
tionary model. We propose a stratified block subsampling strategy, which 
stratifies the subsample based on a "good" segmentation of the sequence 
which is estimated from the data. We first state the block subsampling 
method, and then in Section 4.2.3 give minimal conditions on the estimated 
segmentation for its consistency. In Section 4.3 we discuss possible segmen- 
tation methods. 

4.2.1. Description of algorithm. Assume that we are given a segmenta- 
tion t = (0 = toi ^1) • • • ! tm+i = n)., where m is the number of regions in t. 
Assume that the total size L of the subsample is pre-chosen. We define a 
stratified block subsampling scheme as follows. 

Algorithm 4.3. For i = 1, . . . ,m, let Aj = Aj(t) = \{ti - tj_i)L/n] . We 
use the notation Xi-^i to denote the block of length / starting at i: 

Then, for each subsample. 
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Draw integers N = {A^^i, . . . , Nm}, with Ni chosen uniformly from {ti-i + 
1, . . . ,ti — Xi{t) + 1}, and let 

X* = {XI,. . . = (Xj^^-xiit),- ■ ■ l-'^Af„;A„(^))• 
Repeat the above B times to obtain B subsamples: X*'^, . . . ,X*'^. 

To obtain a confidence interval for //, we assume that the statistic 
has approximately a N(fj,, Tin/n) distribution as in the previous section. For 
each subsample drawn as described in Algorithm 4.3, compute the statistic 
rj,*J} _ •j^*£f>^^-^ _ Form the sampling estimate of variance, 

(4.4) S„ ^ - Y,{Tf - Tl)'{T^t - m 

6=1 

where = Ylb=i'^*L' I ^ ■ '^^^ proceed to estimate the confidence 

interval for T„ in standard ways. For example, in the univariate case where 
cj^j = S„ is a scalar: 

(a) Use X lb -2;i_a/2"^) where Zi_a/2 is the 1 — a/2th quantile of A^(0, 1), 
for a 1 — a confidence interval. 

(b) Efron's percentile method: Let X^*-^^ < • • • < X'^b) ordered X*'^ , 
then use [-^*[5q,/2])'"^(*[_b{i-o/2)])] as a 1 — a confidence interval. 

(c) Use a Studentized interval [Efron (1981)] or Efron's (1987) BCA 
method; see Hall (1992) for an extensive discussion. 

Although the theory for (c) giving the best coverage approximation has 
not been written down, as it has been for the ordinary bootstrap, we expect 
it to continue to hold. Evidently, these approaches can be applied not only to 
vector linear statistics like T„ but also to smooth functions of vector linear 
statistics. 

This algorithm assumes a given segmentation t, which should be set to 
some good estimate t^"^ = {0 = fo, fi, . . . , fm = n} of the true change points 
T^^\ In order for the algorithm to perform well, a good segmentation is crit- 
ical unless the sequence is already reasonably homogeneous. In Section 4.2.2 
below we state the result that the algorithm is consistent if the given seg- 
mentation equals the true changepoints. Then, in Section 4.2.3, we state 
a few assumptions on the data determined segmentation r^"^ which would 
enable us to act as if the segmentation were known and state Theorem 4.5 
to that effect. 

4.2.2. Consistency with true segmentation. Under the hypothetical sit- 
uation where the segmentation t assumed in Algorithm 4.3 is equal to the 
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set of true changepoints, then the algorithm can be easily shown to be con- 
sistent. Here we state the result, which will be proved in the supplemental 
article [Bickel et al. (2010)]. 

First, we state a stronger version of the assumption Bl, which requires 
that the square of the subsample size L = L„ be o(n): 

A8. Ll/n-^0. 

Then, the consistency of Algorithm 4.3 given the true segmentation follows 
from the following theorem. 

Theorem 4.4. // assumptions A1-A8 hold, then 
(4.5) Li/'j:n~'/'[Tl^irn)-T^]^N{0,I) 
in probability, where I is the dx d identity. 

4.2.3. Consistency with estimated segmentation. Let 



be a segmentation of the sequence Xi , , which is determined from the 
data, and which is intended to estimate the true changepoints r = t^"\ We 
will state conditions on t such that the statistic obtained from Algorithm 4.3 
based on r is close to the statistic obtained from the same algorithm based 
on the true segmentation r. This can be stated formally as follows. For any 
segmentation t, let X*(t) be a subsample drawn according to Algorithm 4.3 
based on t. Let F*^{-) be the distribution of \/L{r[X*(t)] - £;*r[X*(t)]} 
conditioned on and t. Then, the desired property on the esti- 

mated segmentation r is that 

(4.6) pli[Fl^;i.(n) , K,Ti")^ 0' as n ^ oo, 

where is the Mallows' metric described in Section 4.1.1. That is, for 
inferential purposes, r[X*(T)] has approximately the same distribution as 
r[X*(T)]. Then, since we have shown in Section 4.2.2 that 

p1,[f;^,„,,«&(s„)] ^pO, 

where $(Sn) is the Gaussian distribution with mean and variance E„, 
(4.6) implies that 

^/L^J:-^{T[x*{f^''^)] - ^*r[x*(t)]} ^ iv(o, /). 

Let hi = f^^\ — f^^"^ . We now state conditions on the estimated segmen- 
tation which guarantee (4.6): 

A9. Un/n^O, 
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AlO. n ^ ^..^^^^j, — > for all A; < CO, 
All. Lnu-^ Ylf^i mini<j<{/„ In - fj\ 0. 

Assumptions A9 and AlO for r^"^ mirror assumptions A3 and A4 for r^"'^ 
Assumption All is a consistency criterion: As the data set grows, the total 
discrepancy in the estimation of t^") by t^""^ must be small. 

Theorem 4.5. Under assumptions Al-All, (4.6) holds. 

The proof is given in the supplemental article [Bickel et al. (2010)]. There 
are trivial extensions of this theorem to smooth functions of vector means, 
which are, in fact, needed but simply cloud the exposition. 

Theorem 4.5 implies that confidence intervals based on subsamples 

{X*'^(f(")):i = l,...,i?} 

constructed by Algorithm 4.3 conditional on t^^'^ are consistent, as long 
as T^"^ satisfies A9-A11. Here is the formal statement of this fact in the 
one-dimensional case, where o"^ replaces and g is the identity. 

Corollary 4.6. Under assumptions Al-All; 

1. Let fj^ be the block subsampling estimate of variance defined in (4.4), 
then 

2. Confidence intervals estimated by Efron's percentile method are consis- 
tent. That is, 

Pi^([na/2]) <l^< ^(*[„(l-a/2)])) "^p 1 " «• 

4.3. Segmentation methods. The objective of the segmentation step is 
to divide the original data sequence Xi , . . . , Xn into approximately homoge- 
neous regions so that the variance estimated in Algorithm 4.3 approximates 
the true variance of T„. A segmentation into regions of constant mean is 
sufficient for guaranteeing that Algorithm 4.3 gives consistent variance es- 
timates. Therefore, we focus here on the segmentation of X into regions of 
constant mean. 

In our simulation and data analysis, we use the dyadic segmentation ap- 
proach, which we motivate and describe here using the simple case where g 
is the identity function. First consider the simple case where Xi , Xn are 
independent with variance 1. In testing the null hypothesis 



Ho:E[Xi] = fi, 
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versus the alternative Ha that there exists 1 < r < n such that = /ii 

for i <T and =^27^^! for i>T, one can show that the following is 

the generalized likelihood ratio test: 



Reject Hq if max nM{j) > c, 

l<j<n 



where 



(4.7) M(i) = ^{Xi.,j - Xv.nf + ^(X,-+i:„ - Xi;„,)2. 

n n 

The maximum likelihood estimate of the changepoint parameter r is the 
value that maximizes M{j). 

Our proof of Theorem 4.5 in the supplemental article [Bickel et al. (2010)] 
shows that, in the case where there is one true change in mean at r, the 
increase in the variance estimated by block subsampling with block length 
L, given no segmentation [i.e., t^") = {0,n}] over the variance estimated by 
Algorithm 4.3 conditioned on a change-point at r, is LM{t) + Op(l). Sub- 
sampling conditioned on any segmentation t 7^ r would inflate the variance 
estimate. Hence, segmenting at f = argmax^ M(j) is optimal in the sense 
that f is the changepoint estimate that minimizes the asymptotic error of 
the block subsampling variance estimate. This fact does not require the 
assumption of independence observations, and is true for any second order 
stationary sequence. Thus, if we knew that there were only one changepoint, 
and if the goal of the segmentation is to obtain the best stratified variance es- 
timate, then the best place to segment is f. The block subsampling variance 
estimate, given the segmentation {0,t,n}, would be 

t-tL/n 

^2 



^ ^ i=l 

(s n—(n—t)L/n 
2~ ) E {^i:i+{n-t)L/n — ^t+l:n) 



i=t+l 

The dyadic segmentation procedure recursively applies the above logic, 
as described below. 



Algorithm 4.7. Fix minimum region length < < n and threshold 
6 > 0. Initialize t = {to = 0, ti =n}. Repeat: 

1. For i = 1, . . . , |t| — 1, let M^^\j) and V^^\j) be respectively the pro- 
cesses (4.7) and (4.8) computed on the subsequence Xt^_j^+i, . . . , Xt-. If 
U-U.i > 2Ls, then let t', = argmax,^_^+i^<j.<i^_i^ M»(j), Bi = M^^{t[) 
and Vi = V^^{t[). Otherwise, let Bi = Q,Vi = 00. 
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2. Let Xi = L(ti — ti-i)/n, and 

If Ji = 0, stop and return t. 

3. Let i* = argmaxj Bi, and t'^'^" = t . 

4. Let t = t Ut"™, reordered so that ti is monotonically increasing in i. 

Each step of the recursion in Algorithm 4.7 proceeds as follows: In step 
1, M^^\j), the generalized likelihood ratio process, and V^'\j), the block 
subsampling variance process, are computed for each segment + of 
the current segmentation. For each segment i, Bi is the maximum squared 
difference in mean for segment i, t'^ is the changepoint estimate that achieves 
this maximum, and XiVi is the estimate of variance given a changepoint at 
t'-. In computing Bi and Vi we do not allow break points that create a region 
with length less than Lg. In step 2, we normalize the statistic (ti — ti^i)Bi by 

the estimated standard deviation vA^V^- If this normalized statistic is below 
the threshold b for every subsegment, then the recursion stops and returns 
the current segmentation. Otherwise, in step 3, the optimal changepoint over 
all regions is chosen to be the cut that maximizes the decrease in error 

of the block subsampling variance estimate. In step 4, this new changepoint 
^(new) -g added to the current segmentation t. 

The computation of Vi in step 2 requires an appropriate choice L = Li, 
of the block subsampling sample size. If the correct segmentation is known, 
then the choice of is easier, as described in Section 4.6. When the seg- 
mentation is not known, but a ball park value of is available, then a 
segmentation can be computed using the ballpark value. The segmentation 
can then be used to obtain a better choice of Lf^. If a ball park value of 
is not available, then the normalization by Vi can be omitted, in which case 
the parameter b in step 3 should be set to 0. This would be equivalent to 
stopping the segmentation only when the next optimal cut will violate the 
minimum region length Lg. In the examples of Section 5.1 we set 6 = 0, thus 
decoupling the choice of Ls from that of Lf,. 

Two more parameters required by Algorithm 4.7 are Lg and b. The choice 
for Lg is discussed in Section 4.5. The choice of b can be guided by the 
fact that, under the null hypothesis, if L were chosen appropriately, then 
(ti — ti_i)M(*)(j)/[y(*)(j)Aj]^/^ is a pivot with approximate distribution Xi- 
Asymptotic approximations for the family-wise error rate have been derived 
in the case of independent sequences [James et al. (1987)]. In the case of 
dependent sequences a Bonferroni adjustment can be applied to adjust for 
multiple testing. We also used the formulas given in James et al. (1987) to 
get a crude cutoff, which seems to work in practice. 
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Algorithm 4.7 belongs to the class of dyadic segmentation algorithms 
for detection of changepoints, the consistency of which were studied by 
Vostrikova (1981). These algorithms are greedy procedures that avoid the 
search over all possible segmentations. They have been applied successfully 
to various settings in biology, including segmentation of GC content [Li et 
al. (2002)] and the analysis of DNA copy number data [Olshen et al. (2004)]. 

The consistency of Algorithm 4.3 requires conditions A9-A11 to be satis- 
fied by the estimated segmentation. The key condition is All which defines 
a consistency criterion on the segmentation. Consistency of dyadic segmen- 
tation has been proved in Vostrikova (1981) for sequences that satisfy the 
following conditions: 

1. Let et = Xt — E[Xt], then ||et|p is a submartingale and E'HetlP < ct^ , 
OO, /3 < 2. 

2. The number of regions is fixed and the region sizes are of order n, that 
is, 

Tn = {nri,...,nru), < n < • • • < r;/. 

It is easy to verify that condition 1 is satisfied by the piecewise station- 
ary model due to the mixing condition A2. Condition 2 is more stringent 
than our assumptions A3 and A4, under which C/„ is allowed to increase 
with n. The consistency of dyadic segmentation for the case of — )• oo 
has been explored in Venkatraman (1992), who gave asymptotic conditions 
on the rejection threshold and on the sizes of the regions to ensure consis- 
tency under the assumption of an independent Gaussian sequence. However, 
these conditions are hard to verify in practice, and for many applications 
in genomics the more stringent condition of Vostrikova (1981) is sufficient. 
Previous studies on segmenting the genome based on features such as the 
GC content [Fu and Curnow (1990); Li et al. (2002)] have used this finite 
regions assumption to achieve reasonable results. 

The dyadic segmentation procedure uses information from the entire se- 
quence to call the first change, and then recursively uses all of the informa- 
tion from each subsegment to call each successive change in that segment. 
An alternative is to use pseudo-sequential procedures, which are sequential 
(online) schemes that have been adapted for changepoint detection when 
the entire sequence of a fixed length is completely observed. The basic idea 
of this class of methods is to do a directional scan starting at one end of 
the sequence. Every time a changepoint is called, the observations prior to 
the changepoint are ignored and the process starts over to look for the next 
change after the previously detected changepoint. Specifically, let tq = and, 
given fi,...,ffc, 

ffc+i = inf {/ > ffc : 5 {Xf^ , Xf^^^ , . . . , J > 6} , 
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where S* is a suitably defined changepoint statistic and 6 is a pre-chosen 
boundary. The estimates from pseudo-sequential schemes depend on the di- 
rection in which the data is scanned. Thus, while they may be suitable for, 
say, timeseries data, they may not be natural for segmentation of genomic 
data, which in most cases do not have an obvious directionality. The con- 
sistency of pseudo-sequential procedures has been studied by Venkatraman 
(1992), who gave conditions on b — bn and t^'^^ for consistency of t^'^^ under 
the setting that Xi are independent Gaussian with changing means. 

4.4. Testing the null hypothesis of no associations. Here we extend the 
results in Section 4.2 to testing the null hypothesis of no association using 
nonlinear statistics. As we discussed in Section 1.2, the inference problem 
typically posed in high-throughput genomics is that of association of two 
features. In terms of our framework we have two 0-1 processes {Ik}k=i,...,n 
and {Jk}k=i,...,n both defined on a segment of length n of the genome. We 
assume that the joint process {Ik,Jk} is piecewise stationary and mixing 
and want to test the hypothesis that the two point processes {Ik}k=i,...,n 
and {Jk}k=i,...,n are independent. We have studied two fairly natural test 
statistics in ENCODE, the "percent basepair overlap," 



and the "regional overlap," Rn, which we define in Section 3, with large 
values of these statistics indicating dependence. The major problem we face 
in constructing a test is what critical values i "^na ^e should specify so 
that 



where Hq is the hypothesis that the vectors (/i, . . . , and ( Ji, . . . , Jn)^ 
are independent, and the corresponding r„Q for 

We aim for statistics based on Bn, Rn (respectively) which are asymp- 
totically Gaussian with mean under Hq. In general, we have to be careful 
about our definition of independence. If we interpret Hq as we stated, simply 
as independence of the vectors (/i, . . . , In)'^ and ( Ji, . . . , Jn)'^ , then 



where lik and Jik refer to the kth basepair in the ith segment and, hence, 
we have 



Bn = 



Y.k=ihJk 
Y2=i h 



(4.9) 




(4.10) 



E.f=iA..<(/)<(J) 
Ef=iA.<(/) 
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The natural estimate of this expectation is then 

u 



i=l 



where Aj = — , /j is the average of Jj^, Jj is the average of Jj^, and / is the 
grand average. We assume the correct segmentation. 

We proceed with construction of a test statistic and estimation of the null 
distribution. In view of (4.10), our test statistic based on Bn is 

(4.11) TO=n^/\Bn-Jn), 
where 

u 



Jn= [^XiiiJi] I -^h, 

(4.12) 



vj=l / ' k=l 



where Aj = Aj(t), Jj = ^(t) 4 

with Jj similarly defined. Here is the algorithm based on this statistic. 

Algorithm 4.8. In order to estimate the null distribution, we do the 
following: 

1. Pick at random without replacement two starting points, Ki and K2, of 
blocks of length L from {1, . . . , n — L}. 

2. Let {Iki+i,---,Ikt_+lV and ( Jr-^+i, . . . , Jr-^+l)^, {Ik2+i, ■ ■ ■ , Ik2+lV 
and (Jx2+i5 ■ ■ ■ , Jk2+l)^ be the two sets of two feature indicators. 

3. Form 



1=1 
L 



L 

1=1 



=:*2 



IJnL = j'^Ik2+iJki+1 



1=1 

and define /*^, J*\^, J*f^ analogously. Let 

^riL — 9 \ f*l "I" 7*2 

rp^ 771* T* 
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where 



and is defined analogously. Let F*^f^, IJ*nLb^ ^tc., be obtained by 
choosing (-RTib, i^2&); 6 = 1, • • • , -B, independently as usual. 

4. We use the following Cnia as a critical value for at level a, 

Cn.La — <^n + I ^ I ^nL{B(l-a))^ 

where T^^i-^ <■■■ < 'FnL{B) ordered T*j^j^ and [•] denotes integer 

part and Jn = ^ Ylk=i Jk- 

5. If the sequence is piecewise stationary with estimated segments j = 1, . . . ,Un 

(i) 

as in Section 4.3, we draw independently B sets of starting points, , . . . , 

(?) (?) (7) 

and ' • • • '-^25' °^ blocks of length XjL from each segment i = 

when each pair is drawn at random without replacement. Here 

Ylf=i -^i = 1 and Xi is proportional to the length of estimated segment i. 

Then piece T*^^ together as follows. Let 



I J nLb = ^ ^iKii+l JiK2b+l ' 



J* It 



Then, 



where 



1 ^ 
etc., 

i=l ^ nLb ^nLb / 



^y-i* 771* T* 

-'nLb ~ ^nLb ~ '^riLbi 



7* _ 'I2i=li^nLb)i'^nLb)'^i 
-^nLb - 



J2i=l{^nLb)^i 

with /:1b = + i:?^. The critical value is 
as before. 
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We can apply this principle more generally to statistics which are func- 
tions of sums of products of /'s and J's evaluated at the same positions. 

The proof of the following theorem is given in the supplemental article 
[Bickel et al. (2010)]. 

Theorem 4.9. If Co, Pq denote distributions under the hypothesis of 
independence and Al-All hold, then 



3. Po[T^ > (f ^ a where <fi-a the [(1 - a)B]th of T^^^, 1 < 



In practice, this definition of independence makes our statistic in effect 
reflect conditional independence of Ik and Jk given the segment to which 
the kth base belongs. This can be unsatisfactory in practice, for instance, 
when the features are concentrated in small segments such that large, sparse 
segments swamp the inference. 

We define independence irrespective of segment identity as saying that 
the average over all permutations of the segments of the joint distribu- 
tion of the point process features are independent. Formally, if (Pi, . . . , Pu), 
(Qi, . . . , Qu) denote the marginal distributions of {{lik :k = 1, . . . , n^} : i = 
1,...,U} and {{Jik-k = l,...,ni}:i = l,...,U}, and {Ri,...,Rn) corre- 
spond to the joint distribution of {{hkiJik) '-^l^ik <n}, then let (Pi, . . . , Pjj) = 
jjy {PttI, • • • 5 Pttu) where vr ranges over all permutations of 1, . . . , ?7. Define 
{Qi, . . . , Qu) and (Pi, . . . , Rfj) similarly. Then, our hypothesis is 



This is simply saying that independence is not conditional on relative ge- 
nomic position of segments. 

It is easy to see that we should now define 




b<B. 



(4.13) 



Hi:R = PxQ. 



(4.14) 



n 



= n 



where Jn= nYli=iJi- 



The reason for this is that 



(4.15) 




Under Hi 
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and 

E^{I) = Ep{I), 

so that the statistic simphfies to the U = 1 form, as above. 

It is clear that the conclusion of (4.9) continues to hold when applied to 

. Note that the form of the bootstrap is unchanged, since is invariant 
under permutation of the segments. 

We now turn to i2„ as defined in Section 3. We assume that Vi:i = l,...,K 
are strongly mixing and stationary. If we assume Hq , we have no closed form 
for Effgl^'^^^^Vk) by which to center Rn- To estimate this quantity, we 
apply a version of the double bootstrap [Beran (1988); Hall (1992); Letson 
and McCuUough (1998)]. 

Consider ^ X^^i under Hi. We draw Bi pairs of large blocks of length 
niL, and we compute the % false region overlap, call it Rl,b = 1, . . . , B, in 
each pair of "large" blocks, where mL is still negligible compared to segment 
size, but m — >• oo. Define 

1 

(4.16) EHARn) = ^^Rl 

6=1 

and 

(4.17) fi«) = ni/2(i2„-^^^(i2„)). 

Note that we again want to consider independence irrespective of segment 
identity, so that above are computed without any segmentation beyond 
the natural segmentation, for example, chromosomes. Now compute the em- 
pirical distribution of r„ using the size L segmented block subsampling 

— ( R) 

and proceed as usual. We can define Tn corresponding to Hq in the same 
way, though we now have to cut up our mL blocks in proportion to segment 
sizes to center. We do not pursue this since the Hi hypothesis gives stable 
results while Hq does not. 

We have not proved a result justifying the use of the double bootstrap in 
this way, but simulations suggest that it behaves as expected; see Section 5.3. 

4.5. Choice of segment size Lg. Two tuning parameters appear in our 
procedure in addition to b appearing in the segmentation scheme. Lg is the 
smallest allowed size of a "stationary" piece after segmentation. It essentially 
determines the scale of the segmentation, which we view as an application 
context dependent quantity that users need to control. The reason is that 
stationarity is a matter of scale. To put it concretely, consider the situations 
where I^, k = 1, . . . ,n, are simply the base pair nucleotides A,C,G,T and 
consider the scale of a large gene of length n. Then, it seems natural that the 
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exons and introns correspond to consecutive stationary regimes. However, 
suppose we now move our scale to a gene rich genomic region of length N. 
Now, it is the genes themselves and the intergenic regions which correspond 
to an at least initial segmentation. 

This dependence of segmentation on scale has a natural intuitive conse- 
quence. Consider a statistic such as base pair overlap of two features. As one 
increases the region size n in which one wishes to declare significant overlap, 
the standard deviation of the statistic, which is 0(n~^/^), decreases, and 
p-values decrease. However, if, as one would expect, the region over which 
n increases becomes homogeneous on a larger scale, coarser segmentation 
would then be called for. This, as we have noted, necessarily increases the 
standard deviation of the statistic, and from that point of view significance 
becomes more difficult to achieve. 

Put another way, it is not impossible to think of the whole genome itself 
as being stationary on a large scale, but that we can hierarchically segment 
the genome in many ways so that each large subsegment is stationary, but 
the segments are not identically distributed, even where they are of equal 
length. For instance, a natural initial segmentation is to chromosomes. 

Finally, we argue in mathematical terms going the other way from in- 
homogeneity to homogeneity. Start with a sequence of independent (say) 
Bernoulli variables Xi,X2, ■ ■ ■ ,Xn, with being Bernoulli (pfc)- If the pk 
are arbitrary, the only segmentation we can perform is the useless trivial 
one, where each X^ is its own segment. But, now we tell ourselves that pk, 
1 <k < n/2, are drawn i.i.d. from C/(0, 1/2) and for n/2 + 1 < A: < n from 
C/(l/2, 1), we suddenly just have two segments to consider. 

Thus, Lg in our view needs to be treated as the smallest scale on which 
homogeneity is expected. Note that these considerations are not limited to 
testing. They also govern confidence intervals, as discussed in Section 4.2.3. 

4.6. Choice of Lb, the subsample size. We believe that the best way to 
choose Lh, after segmentation has been estimated, is so that the resulting 
subsampling distribution of the statistics is as stable as possible and Lg is 
large but <^ n. We also formally consider Gaussianity of the distribution 
and, if possible, maximizing that feature as well. This does not necessarily 
mean segment more — since AlO and All may then fail. We advocate but 
do not analyze further the following proposal put forward in m-out-of-n 
subsampling by Bickel, Gotze and van Zwet (1997) and analyzed in detail 
by Gotze and Rackauskas (2001) and Bickel and Sakov (2008): 

1. Let X*{L) be the statistic computed from the sample drawn with blocks 
of length L. Compute the block subsampling distribution jCi^^ for the 



statistic 
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and = p^n, where p <1 and v = 1,2, . . . ,V . Note that these Ly provide 
candidate choices of the subsample size Lj,. 

2. Compute a "distance" d*{v) between Cl^ and Cl^_j^. 

3. Choose Lh = Ly* , where Vq = argmin(i*(?;). 

In practice, we use for d* (v) the pseudometric 



^IQRiCLj-IQR{CL^.,) , 

'V 

where IQR{C) is the interquartile range of C 

In continuing work with Gotze and van Zwet, we are in the process of 
trying to show that, under mild conditions, as n — oo we have Lf, — )• oo, 
Lft/n — 7- 0. More significantly, we expect that in a fashion analogous to 
Gotze and Rackauskas (2001) and Bickel and Sakov (2008), under restric- 
tive conditions and for suitable choice of distance, L;, yields an estimate 
which is as good as possible in the following sense: If is the actual dis- 
tribution of y/n{Xn — p), d{m) is the distance between and Cl^, and 
vq = argmin^ d{v), then 

divo) 

Thus, Ly* = p^on yields performance of the same order as p'"°n. 



5. Simulation and data studies. 



5.1. Simulation study L In this section we perform a simple simulation 
study to demonstrate the power of our block-subsampling method in the 
situation where features are naturally clustered. We simulated a binary se- 
quence xi,. . . ,Xn with n = 10,000 by the following Markovian model: 



1) = ? + (!- Po)5£^ forfc = 2,...,n, 
2 w 

where w is the order of the Markov model or, intuitively, the size of the de- 
pendency window, and pQ indicates the level of dependency. Smaller pQ gives 
stronger dependence between neighboring positions. We define the following 
two types of features at position k in the sequence: 

• Feature I: the occurrence of sequence 11,100 starting at position k. 

• Feature II: the occurrence of more than six I's in the next 10 consecutive 
positions including the current position k. 



(5.1) 



Pixi 



P{Xk 



SUBSAMPLING METHODS FOR GENOMIC INFERENCE 



27 



From model (5.1), the feature II will occur in clusters in the sequence. The 
overlap between the two types of features can be measured by the statistic 

g _ Z]fc=l ^kJk 

with Jfc, Jfc being binary and indicating the occurrences of sites of types I 
and II, respectively. 

Figure 1 shows the distribution of S estimated through different ways: 

• The true distribution is the empirical distribution of estimated S from 
10,000 random sequences generated under model (5.1). 

• The Ordinary Bootstrap distribution is derived by performing a base- 
by-base uniform sampling of the sequence xi, . . . ,Xn to construct 10,000 
sequences of length n. 

• The Feature Randomization distribution is derived by keeping features of 
type I fixed and randomizing uniformly the start positions of the features 
of type II to construct 10,000 sequences of length n. 

• The block subsampling distribution is derived by drawing independent 
samples of blocks of length L = 40 and stringing the blocks together to 
construct 10,000 sequences of length n. 

From Figure 1, we see that block subsampling produces more reliable 
estimates of the variance of S compared to the naive methods: ordinary 
bootstrapping and feature randomization. Both naive methods ignore the 
dependence between positions and thus fail to take into account the natural 
clumps of the feature II. This is the key reason for the poor performance of 
the two naive methods. 

5.2. Simulation study Ila. Our second simulation study examines the 
case where the sequence is generated from a piecewise stationary model 
where there is more than one homogeneous region. As before, we consider 
the problem of estimating the percentage of base pair overlap between two 
features, and compare the performance of four strategies: 

1. feature randomization, 

2. naive block subsampling from unsegmented sequence, 

3. block subsampling from sequence segmented using the true changepoints, 
and 

4. block subsampling from sequence segmented using the changepoints es- 
timated by the dyadic segmentation method we described in Section 4.3. 

In our simulation model, we generate Xt, Yt independently from a Neyman- 
Scott process characterized as follows: 

1. Cluster centers occur along the sequence according to a Poisson process 
of rate Aj in region i. 



28 



P. J. BICKEL ET AL. 
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Fig. 1. Comparison of different subsampUng schemes. 



2. The number of features in each cluster fohows Poisson distribution with 
mean a. 

3. The start of features are located at a geometric distance (mean //) from 
the cluster center. 

4. The features are generated with length that is geometric with mean /?. 

5. Overlap between features generated using steps 1-4 are ignored. 

For simplicity, we let there be only 2 homogeneous regions, each of length 
T = 10,000. Consider the setting where the parameters for the two regions 
have the following values: (Ai, ai, ^i, /3i) = (0.01, 10, 10, 5) and (A2, a2, /i25 /32) = 
(0.02,10,10,5). Figure 2 shows a simulated example, where features A and 
B are plotted as well as their overlap. Figure 2 also shows the cumulative 
sum and the segmentation. Figure 3 shows respectively the histograms of 
the estimated distribution of the overlap statistic X* centered and scaled. 
It is clear that the feature randomization underestimates the standard devi- 
ation, whereas naive block subsampling without segmentation gives a mix- 
ture distribution with long tails. Strategy 3, which subsamples assuming the 
true changepoint at r is known, gives the correct distribution as expected. 
Strategy 4, which uses the estimated changepoint, reassurringly gives a very 
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Fig. 2. Example of one instance from simulation model 2. Top plot shows cumulative 
sum and estimated segmentation. 



similar distribution to strategy 3. Table 1 gives the standard deviation esti- 
mates. 

5.3. Simulation study lib. We utilized the Neyman-Scott process de- 
scribed in simulation study Ila to study the consistency of the double boot- 
strap method described in Section 4.4 for estimating the distribution of 
Rn- We consider the simple case where there is one homogeneous region. 
We utilized a larger region and a parameterization of the process that re- 
sults in more and longer feature instances than we consider in the study 
above. The parameters are T = 5 Mb and (Ai, ai, /ii, = (A2, a2, /i2i /52) = 
(0.05,10,100,75). This yields a pair of feature-sets with around 5000 in- 
stances, where each feature-set covers around 17% of the 5 Mb region. We 
simulated 20,000 pairs of feature-sets from this process, and found that the 
mean of region-overlap between pairs, Rn, was 0.293, and the standard error 
was 0.0072. We subsampled 1000 sets of 10,000 draws from this distribution, 
each of which yielded the mean above (to 3 significant digits), and the stan- 
dard errors ranged from 0.0071 to 0.0073, which corresponds almost exactly 
to the theoretical 95% confidence interval for the standard error of the stan- 
dard error of a Gaussian with standard deviation 0.0072 after 10,000 draws. 
Not surprisingly, the distribution of Rn was Gaussian, as indicated by the 
Lilliefors and the Shapiro-Wilk test, which did not reject the hypothesis 
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of Gaussianity at a significance level 0.05 with the full sample of 20,000 
observations. 

In order to test the capacity of segmented subsampling with a version 
of the double block bootstrap to discover this distribution based on only a 
single pair of observations, we selected the most extreme pair found during 
simulation, for which i2„ was 0.321, corresponding to a z-score of 3.87. Since 
the number of feature instances is itself a random quantity, the job of block 
subsampling is particularly difficult: when ii„ is far to the right of expec- 
tation, the feature-sets tend to contain more feature instances than those 
closer to the center. The pair we chose was no exception. The results are 
given in Figure 4. Hence, it is not surprising that our subsampling procedure 
tends to over-estimate the mean. The Lilliefors test fails to reject the Gaus- 
sianity of any of the resulting distributions with sample sizes up to 1000 at 
a significance level of 0.05, but does reject it for several of the smaller block- 
sizes when the sample size is pushed up to 10,000. The Shapiro-Wilk test, 
however, detects departures from Gaussianity for many of the distributions 
at a significance level of 0.05 for samples larger than 500. This is because 
Rn is predicated on relatively small counts of feature-instance overlaps and, 
hence, the distributions tend to have heavy tails. 




We note that the global minimum of the Inter-Quantile (IQ) statistic was 
found at U/Lr = 0.15 and L^/n = 0.06. That is, 0.9% of the 5 Mb region, 
or 45 Kb, were included in each block sample. This block sample size is 
certainly sufficient to capture multiple feature-clusters, since the parameter- 
ized Neyman-Scott process above yields an average inter-cluster distance of 
about 1 Kb. 

To corroborate our hypothesis that the mean was overestimated because 
the feature-sets we chose were more dense than most, we applied our method 
with learned parametrization, L^/L^ = 0.15 and L^/n = 0.06, for a pair of 



Table 1 

Estimates of standard error by four sampling strategies in simulation study Ha 
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Subsample, true segmentation 
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Subsample, estimated segmentation 


l.Oe-002 
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feature-sets with i?„ = 0.293, the population average. Indeed, the mean was 
estimated, after 10,000 samples, to be 0.293, and o-„ was 0.0072. 

Although the purpose of this simulation was merely to check the con- 
sistency of our version of the double block bootstrap for data not unlike 
actual genomic data, for example, ChlP-seq "broad-peaks," we decided to 
also check the performance of feature-start site shuffling for the same pair 
of feature-sets used above. In the case of Bn, the basepair overlap statis- 
tic, feature start-site shuffling correctly estimates the mean, but can (in the 
stationary case) radically underestimate the standard deviation. The same 
is not true in the case of Rn- Start-site shuffling is not assured (under our 
model) to provide an unbiased estimate of the mean or the standard devi- 
ation. We drew 10,000 samples from the distribution under shuffling, and 
found the mean to be 0.337, and the standard deviation to be 0.0070, which 
indicates that the pair of feature-sets under study in fact overlap slightly 
less than expected at random (p~ 0.011). The fact that this conclusion is 
actually in the wrong direction in this relatively easy, stationary example 
should make us skeptical of studies that rely upon start-site shuffling to draw 
conclusions about statistics that cannot be defined locally, such as Rn- 

Our discussion of this simulation and the following real data examples 
exhibit the subtleties inherent in our approach. Subtleties appear whenever 
inference follows regularization. 

5.4. Association of noncoding ENCODE annotations and constrained se- 
quences. Here we present a real example of the study of association between 
"constrained sequences" and "nonexonic annotations" from the ENCODE 
project, hmited to the 1.87 Mbp ENCODE Pilot Region ENmOOl, also 
known as the CFTR locus. The constrained sequences are those highly con- 
served between human and the 14 mammalian species studied and sequenced 
by the ENCODE consortium. Enrichment of evolutionary constraint at the 
"nonexonic annotations" sites implies that the biochemical assays employed 
by the ENCODE consortium are capable of identifying biologically func- 
tional elements. We tested the association of noncoding annotations and 
constrained elements using the base pair overlap statistic in Section 4.3 
using the conditional formulation. We interpret the lack of association as, 
given sequence composition and the distribution of each feature along the 
genome as observed, the assignments (by nature) of features A and B to 
individual bases are made independently. We derive the significance of the 
observed statistic under this null hypothesis following the method proposed 
in Section 4.3. 

As we discussed, we have several issues to deal with: 

(i) How do we segment? That is, what statistic(s) do we use for segmena- 
tion? 
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(ii) Is segmentation necessary or is the region sufficiently homogeneous? 

(iii) If we segment, what should we use? 

(iv) Given a segmentation, what Lf, is appropriate? 

Here are our methods: 

(a) The simplest choice for (i) and the one we followed was to segment 
according to both numerator and denominator in intersect parti- 
tions and enforce an Ls bound. Given our theory, this should ensure 
homogeneity in the mean of Bn- 

(b) Although strictly speaking (ii) and (iii) can be combined, we experi- 
mented a bit to also see if the theory of Section 4.1 was borne out in 
practice. 

(c) We did not use the V statistic and thus only had to choose Lg. Again, we 
experimented with Ls = 500 Kb to preserve as much genomic structure 
as possible, and Ls = 200 Kb to ensure we had not under segmented. 

(d) We explored a variety of values of Lb, and studied the consistency be- 
tween nearby values under the interquartile statistic (IQ statistic) dis- 
cussed in Section 4.6. We draw conclusions based on the value of Li, that 
optimizes local consistency. 

To segment the data, we applied the method in Section 4.3 to both fea- 
tures A and B, or in the language of Section 4, / and J, and then combined 
the segmentation. In segmenting each feature, we experimented with mini- 
mum segment lengths Lg of 200 and 500 Kb. Before subsampling, we com- 
bined the segmentations of A and B by taking a union of the changepoints. 
This created regions with length less than Ls- However, the total length of 
these regions comprise <0.1% of the total Encode region, and were left out 
of the remaining analyses. 

If the sequence were sufficiently homogeneous, we could forgo the ini- 
tial segmentation step. Figure 5 shows an estimate of variance of Bn (with 
the appropriate renormalization) for a reasonable range of L^, both before 
and after segmentation. Two trends are clearly evident. First, segmentation 
greatly reduces the estimated variance. As we discussed in Section 4.1.2, 
inhomogeneity of the sequence causes an inflated estimate of variance. If the 
data were homogeneous, segmentation should not change the variance esti- 
mate. Thus, the fact that the estimated variances drop after segmentation for 
such a large range of L^'s suggests that the data are inhomogeneous. Second, 
and more importantly, the estimated variance of Bn increases sharply with 
increasing in the unsegmented data. This is evidence of inhomogeneity 
in the mean of Bn across this ENCODE region: underlying shifts in mean, 
if ignored, can be mistaken for spurious long range autocorrelation, which 
also implicitly runs against our assumption. In either Theorem 4.2 

suggests, we would be overly conservative. Thus, a preliminary exploration 
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of the data convinces us that this ENCODE region is inhomogeneous in / 
and/or J and segmentation is necessary. 

We found that 200 and 500 Kb gave 5 and 3 segments respectively. Fig- 
ure 6 gives the results for 500 Kb. What is fairly surprising, but reassuring, 
is that over the whole broad range of L;, considered, the estimated SD of 
the statistic under the null was essentially flat after segmentation. Flat here 
means that variability was within a Monte Carlo SD for the 10,000 repli- 
cations we used. We would expect longer values of to include, in our 
estimate of a, additional covariance between distant genomic positions cap- 
tured by the extended block-length. The fact that this, by and large, does not 
appear to be happening is consistent with our hypothesis that the relevant 
mixing distance is indeed quite small compared to the size of approximately 
stationary regimes. 

We found that there is still moderate deviation from Gaussianity in both 
the segmented and unsegmented case for 0.05 < Lf, < 0.25, both in the tails, 
as detected by the Shapiro~Wilk test, and in the body of the distribution 
under the Lilliefors test. With a sample size of 100, neither test detects this 
departure, but at a sample size of only 500, it is detected under a number 
of parameterizations of L^. As we discussed in Section 4.5, the definition 
of stationarity depends on the scale at which we view the genome. This 
suggests that our segmentation still does not take care of inhomogenity in 
the variance. Hence, as we have mentioned, if we use the variance for the 
Gaussian approximation, our results are still conservative. 
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Fig. 6. Comparison of block subsampling distributions, p^n vs. p^'^^n under the IQR 
statistic. Estimates a„ and resulting z-scores of Bn shown. 

The scientific conclusion of this example is that, indeed, there is strong 
association since the z-value is over 9 SDs. We note that the effect of seg- 
mentation on our scientific conclusion is essentially nonexistent. However, it 
is comforting to note that the change in (with and without segmentation) 
variance is in the correct direction. 

5.5. The association of copy number variation with RefSeq annotated ex- 
ons in the human genome. In this example, we reanalyze a published data 
set; this reanalysis leads to a different conclusion from the one made by 
the original paper. In 2006, Redon et al. published a set of 1445 genomic 
regions with observed Copy Number Variation (CNVs) across individuals. 
These regions consist of both deletions and insertions, and more than half 
of them overlap genes. In the paper, the authors reported, among other 
things, a paucity of overlap with RefSeq genes at a significance level of 0.05. 
The statistic that they used is precisely our marginal formulation of the 
region overlap statistic but the null distribution to which they referred 
it is quite different. Their null was computed by randomly permuting both 
genes and CNVs, and hence treats the entire genome (or at least entire chro- 
mosomes) as homogeneous, and the distances between feature-instances as 
exponential. Thus, if feature-instance lengths were all 1 bp, this would be 
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a Poisson process. As discussed in Section 5.3, under our model this proce- 
dure provides an unbiased estimate of the mean in the case of the Bn , but is 
unpredictable with respect to its estimate of the variance. In the case of Rn, 
it is unpredictable with respect to both the mean and the variance. Here, 
for comparison with the result of Redon et al. (2006), we examine only . 

Although we have attempted to replicate this portion of the Redon study, 
undoubtedly there are small differences between our efforts and those of 
Redon et al. (2006). For instance, we have masked all genomic repeats in the 
"Repeat Masker" track on the UCSC genome browser (genome.ucsc.edu). 
Redon et al. also considered patterns of repeats in their analysis, but may 
have utilized an at least slightly different map of genomic repeats. We find 
that 61.8% of the CNVs overlap RefSeq genes by at least 1 basepair. That 
is, we wish to assess the significance of our observed statistic i?„ = 0.618. 

The calibration of the subsampling procedure is nontrivial, especially in 
this application where we must consider the additional parameter L^. Hence, 
in the following we provide complete detail regarding the calibration of our 
method for the data of Redon et al. (2006). 

As before, our analysis begins with an assessment of the need for seg- 
mentation. In this case, we are dealing with whole human chromosomes, we 
expect that, in general, at least some segmentation is necessary. We seg- 
mented down to a minimum segment length of 10,000,000 bps (10 Mbs), 
letting Ls = 10 Mb. The mean length of these CNVs is around 250 Kb, and 
they are not uniformly distributed, so we are compelled not to segment down 
to regions much smaller than 10 Mb by our desire to capture the appropriate 
spatial distribution of clusters of feature-instances. To assess the sufficiency 
of the resulting segmentation, we examine the Gaussianity of the segmented 
subsampling distributions. This examination is tied to our selection of block 
length. 

To select an inner block length, L^, and an outer block- length, L^, we 
drew 10,000 samples for each of several lengths. We chose to use a linear, 
rather than exponential, scale for L^/n: we selected 10 values from 0.01 to 
0.10 in increments of 0.01. We chose three values of L^/Lr, 0.05, 0.10 and 
0.20. Each of these parameterizations yields several responses, including: an 
estimated z-score, d*{k), and measures of Gaussianity. In Figure 7, we plot 
the relationship between the estimated z-score, d*{k), Lr and Lf,. Regarding 
the Gaussianity of the resulting distributions, at a significance level of 0.01 
and a sample size of 5000, neither the Shapiro-Wilk nor the Lilliefors test 
rejected the null hypothesis of Gaussianity for any of the 30 explored pa- 
rameterizations. To supplement our biological intuition that segmentation 
is necessary when whole chromosomes are considered, we used the same 
30 parameterizations with the unsegmented data, and performed the same 
tests to check the Gaussianity of the resulting distributions. Of the 30 pa- 
rameterizations, 3 showed departures from Gaussianity under Lilliefors test, 
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Fig. 7. The relationship between the estimated z-score, d*(k), Lr and Lb- As Lr in- 
creases, our estimate of (t„ (not shown) increases, which drives the estimated z-score 
down. As Lr becomes too small, we lose the stability of our estimates, and d*{k) increases. 
For the smallest value of Lr shown here, the estimated z-score increases sharply, but the 
corresponding value of d*{k) indicates that this parameterization is unreliable. The ideal 
parameterization under d*{k) is given by Lr/n = 0.09 and L^/ Lr = 0.20. 

and 9 showed strong departures in the tails under the Shapiro- Wilks test. 
This indicates, as expected, that segmentation has substantially improved 
the Gaussianity of the sample distributions. In practice, one might attempt 
a finer segmentation in hopes of further reducing the (conservative) bias in 
(T„. For this example we are satisfied with the current segmentation. 

The global minimum of d*{k) occurs for Lr/n = 0.09 and L^i/Lj. = 0.20. 
This parameterization yields an estimated z-score of 1.25 and, therefore, we 
conclude that we cannot corroborate the result of Redon et al. (2006). Un- 
der our model it appears that CNVs are, if anything, very slightly positively 
associated with genes [p ~ 0.105). We note that a few parameterizations, 
as shown in Figure 7, do produce z-scores greater than 2. However, these 
parameterizations correspond to large values of d*{k) and, furthermore, sig- 
nificance is in the opposite direction reported by Redon et al. (2006). This 
highlights the need for carefully defined null distributions in genomic studies. 
We are not suggesting that the results presented necessarily invalidate the 
corresponding result of Redon et al. (2006), but rather we caution that sci- 
entific conclusions of this kind are predicated on how the researcher defines 
"at random," and that this definition should be made to reflect, as much 
as possible, that which is known about the actual distribution of genomic 
elements. We presume that authors wish, in general, to err on the side of 
caution, and hence do not wish to report significant association when the 
association can be explained simply by a conservative choice of null. 

SUPPLEMENTARY MATERIAL 

Some theorems in subsampling methods for genomic inference (DOI: 
10.1214/10-AOAS363SUPP; .pdf). In Supplementary Material, we provide 
theoretical proofs to the theorems presented in the main text. 
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